11 research outputs found

    Brief Announcement: Optimal Bit-Reversal Using Vector Permutations

    No full text
    Accepted versio

    GPUVerify: A Verifier for GPU Kernels

    Get PDF
    We present a technique for verifying race- and divergence-freedom of GPU kernels that are written in mainstream ker-nel programming languages such as OpenCL and CUDA. Our approach is founded on a novel formal operational se-mantics for GPU programming termed synchronous, delayed visibility (SDV) semantics. The SDV semantics provides a precise definition of barrier divergence in GPU kernels and allows kernel verification to be reduced to analysis of a sequential program, thereby completely avoiding the need to reason about thread interleavings, and allowing existing modular techniques for program verification to be leveraged. We describe an efficient encoding for data race detection and propose a method for automatically inferring loop invari-ants required for verification. We have implemented these techniques as a practical verification tool, GPUVerify, which can be applied directly to OpenCL and CUDA source code. We evaluate GPUVerify with respect to a set of 163 kernels drawn from public and commercial sources. Our evaluation demonstrates that GPUVerify is capable of efficient, auto-matic verification of a large number of real-world kernels

    Set-Congruence Dynamic Analysis for Thread-Level Speculation (TLS)

    No full text

    Extending a C-like language for portable SIMD programming

    Full text link
    Abstract SIMD instructions are common in CPUs for years now. Using these instructions effectively requires not only vectorization of code, but also modifications to the data layout. However, automatic vectorization techniques are often not powerful enough and suffer from restricted scope of applicability; hence, programmers often vectorize their programs manually by using intrinsics: compiler-known functions that directly expand to machine instructions. They significantly decrease programmer productivity by enforcing a very errorprone and hard-to-read assembly-like programming style. Furthermore, intrinsics are not portable because they are tied to a specific instruction set. In this paper, we show how a C-like language can be extended to allow for portable and efficient SIMD programming. Our extension puts the programmer in total control over where and how controlflow vectorization is triggered. We present a type system and a formal semantics of our extension and prove the soundness of the type system. Using our prototype implementation IVL that targets Intel's MIC architecture and SSE instruction set, we show that the generated code is roughly on par with handwritten intrinsic code

    Тематика рефератов

    Get PDF
    © 2015 IEEE.Programming accelerators such as GPUs withlow-level APIs and languages such as OpenCL and CUDAis difficult, error-prone, and not performance-portable. Au-tomatic parallelization and domain specific languages (DSLs)have been proposed to hide complexity and regain performanceportability. We present P ENCIL, a rigorously-defined subset ofGNU C99 - enriched with additional language constructs - that enables compilers to exploit parallelism and produce highlyoptimized code when targeting accelerators. P ENCIL aims toserve both as a portable implementation language for libraries, and as a target language for DSL compilers. We implemented a P ENCIL-to-OpenCL backend using astate-of-the-art polyhedral compiler. The polyhedral compiler, extended to handle data-dependent control flow and non-affinearray accesses, generates optimized OpenCL code. To demon-strate the potential and performance portability of P ENCILand the P ENCIL-to-OpenCL compiler, we consider a numberof image processing kernels, a set of benchmarks from theRodinia and SHOC suites, and DSL embedding scenarios forlinear algebra (BLAS) and signal processing radar applications(SpearDE), and present experimental results for four GPUplatforms: AMD Radeon HD 5670 and R9 285, NVIDIAGTX 470, and ARM Mali-T604
    corecore